Transcribing Broadcast News: The LIMSI Nov96 Hub4 System
نویسنده
چکیده
In this paper we report on the LIMSI Nov96 Hub4 system for transcription of broadcast news shows. We describe the development work in moving from laboratory read speech data to realworld speech data in order to build a system for the ARPA Nov96 evaluation. Two main problems were addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers). The speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on large text corpora. The base acoustic models were trained on the WSJ0/WSJ1 corpus, and adapted using MAP estimation with 35 hours of transcribed task-specific training data. The 65k language models are trained on 160 million words of newspaper texts and 132 million words of broadcast news transcriptions. The problem of segmenting the continuous stream of data was investigated using 10 MarketPlace shows. The overall word transcription error of the Nov96 partitioned evaluation test data was 27.1%.
منابع مشابه
Transcribing Broadcast News: The LIMSI Nov96
In this paper we report on the LIMSI Nov96 Hub4 system for transcription of broadcast news shows. We describe the development work in moving from laboratory read speech data to realworld speech data in order to build a system for the ARPA Nov96 evaluation. Two main problems were addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the sign...
متن کاملTranscribing broadcast news shows
While significant improvements have been made over the last 5 years in large vocabulary continuous speech recognition of large read-speech corpora such as the ARPA Wall Street Journal-based CSR corpus (WSJ) for American English and the BREF corpus for French, these tasks remain relatively artificial. In this paper we report on our development work in moving from laboratory read speech data to r...
متن کاملDevelopment of The RU Hub4 system
This paper describes preliminary development of a broadcast news transcribing system for this year's Hub4 evaluation. The recognition system uses CROWNS (developed at RU for the 1995 Hub3 tasks) with several modi cations to handle the news programming task. Features such as model adaptation have been added to quickly provide acoustic models thought appropriate for the new task, even though the ...
متن کاملBroadcast news transcription in Mandarin
In this paper, our work in developing a Mandarin broadcast news transcription system is described. The main focus of this work is a port of the LIMSI American English broadcast news transcription system to the Chinese Mandarin language. The system consists of an audio partitioner and an HMM-based continuous speech recognizer. The acoustic models were trained on about 24 hours of data from the 1...
متن کاملImproved modeling and efficiency for automatic transcription of Broadcast News
Over the last few years, the DARPA-sponsored Hub4 continuous speech recognition evaluations have pushed speech recognition technology for the very interesting and difficult task of automatically transcribing broadcast news. In this paper, we report on our research and progress on this problem. We focus on individual techniques we developed, rather than on descriptions of our evaluation systems....
متن کامل